A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework
نویسنده
چکیده
The Map/Reduce framework is a programming model recently introduced by Google Inc. to support distributed computing on very large datasets across a large number of machines. It provides a simple but yet powerful way to implement distributed applications without having deeper knowledge of parallel programming. Each participating node executes Map and/or Reduce tasks which involve reading and writing large datasets. In this work, we exploited the open source Hadoop implementation of the Map/Reduce framework and developed a theoretical cost model to evaluate the I/O cost induced on each node during the execution of a task. Furthermore, we used our model to evaluate and compare two basic join algorithms already provided by the framework — the reduce side and the map side join — and a third one we implemented and which makes use of a Bloom filter on the map tasks. The experimental results proved the validity of our cost model and, furthermore, stressed out that our proposed algorithm claims the least I/O cost in comparison to the others. This work is expected to provide a good insight into the Map/Reduce framework in terms of I/O cost on the one hand, and a thorough analysis of three join implementations under the Map/Reduce dataflow on the other hand.
منابع مشابه
Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
he Map/Reduce framework-a parallel processing paradigm-is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous ...
متن کاملSentiment Analysis of Social Networking Data Using Categorized Dictionary
Sentiment analysis is the process of analyzing a person’s perception or belief about a particular subject matter. However, finding correct opinion or interest from multi-facet sentiment data is a tedious task. In this paper, a method to improve the sentiment accuracy by utilizing the concept of categorized dictionary for sentiment classification and analysis is proposed. A categorized dictiona...
متن کاملAdaptive Join Plan Generation in Hadoop For CPS296.1 Course Project
Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a cle...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملPerformance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis
Hadoop Distributed File System (HDFS) is quite popular in the big data world. It not only provides a framework for storing data in a distributed environment, but also has set of tools to retrieve and process these data using map-reduce concept. This paper discusses the result of evaluation of major tools such as Hive, Pigand hadoop streaming for solving problems from a relational prospective an...
متن کامل